IN PROGRESS general dev contributions. Currently tracking: #17

mmcdermott · 2024-06-13T18:45:51Z

This replaces #12 , which I will close momentarily.

Get a working pre-processing ETL example for conditionally independent unique timepoint point process transformers (aka ESGPT style models)
Track general metadata about a MEDS dataframe.

Summary by CodeRabbit

New Features
- Introduced a new stage for collecting code metadata, enhancing data insights with detailed aggregations.
- Added new preprocessing scripts for filtering codes, outliers, patients, and tokenization of patient data, improving data handling and analysis.
Refactor
- Restructured pipeline stages in preprocessing configurations for better clarity and efficiency.
- Updated function signatures and parameters in multiple scripts for consistent handling of configurations and data processing.
Bug Fixes
- Corrected directory paths and file handling in preprocessing scripts ensuring proper data flow and output generation.

…duce functions.

coderabbitai · 2024-06-13T18:46:43Z

Walkthrough

The changes integrate new stages and refine configuration pipelines for preprocessing and extraction scripts. They introduce metadata collection, data normalization, and tokenization processes, restructured configurations, and filtering mechanisms for patients, codes, and outliers. The new functionalities enhance the data processing pipeline with better handling of code metadata and preparatory tasks for subsequent data analysis.

Changes

Files	Change Summary
`configs/extraction.yaml`	Added `collect_code_metadata` stage to compute code metadata.
`configs/preprocess.yaml`	Reorganized pipeline stages, updated aggregation, summarization settings, and new stage configurations.
`scripts/extraction/collect_code_metadata.py`	Introduced script to compute code metadata using mapper and reducer functions.
`scripts/preprocessing/filter_codes.py`	New script for filtering patient data based on codes using Polars.
`scripts/preprocessing/filter_outliers.py`	New script to filter outliers in data using Polars.
`scripts/preprocessing/filter_patients.py`	Updated input/output directory paths.
`scripts/preprocessing/fit_vocabulary_indices.py`	New script to assign vocabulary indices to code metadata.
`scripts/preprocessing/normalize.py`	New script for data normalization using Polars.
`scripts/preprocessing/tensorize.py`	New script for tensorizing data into specific formats.
`scripts/preprocessing/tokenization.py`	New script for tokenizing patient data stored in Parquet format.
`src/MEDS_polars_functions/code_metadata.py`	Updated mapper and reducer functions for code metadata; refined validation logic.
`src/MEDS_polars_functions/filter_measurements.py`	Introduced functions for filtering codes and outliers.
`src/MEDS_polars_functions/get_vocabulary.py`	New helper functions for normalizing a MEDS dataset and assigning vocabulary indices.
`src/MEDS_polars_functions/normalization.py`	Updated normalization function parameters and schema for code metadata.
`src/MEDS_polars_functions/tensorize.py`	New function to convert tokenized dataframe into nested ragged tensor formats.
`src/MEDS_polars_functions/tokenize.py`	Introduced functions for tokenizing MEDS datasets into temporal sequences for deep-learning.
`tests/test_extraction.py`	Updated test data for subjects and admit vitals.

Poem

In data's waltz, we step anew,
With stages fresh and codes we cue,
Metadata sings, tokens align,
Filters and tensors in a perfect line.
Normalize, summarize—oh what a flow,
Through Polars in the data's glow.
A dance of bytes, a rabbit's cheer,
For changes bright, the future's clear.

(_/)
( •,•)
(")_(")

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

Share

Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>.
- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai generate interesting stats about this repository and render them as a table.
- @coderabbitai show all the console.log statements in this repository.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (invoked as PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Additionally, you can add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.

CodeRabbit Configration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

…s will also facilitate joining with ontologies and downstream usage (and it tests that, more generally pre-processing oriented utility).

…cy yet.

Added a code metadata extraction step for core MEDS extraction.

coderabbitai

Actionable comments posted: 4

Outside diff range and nitpick comments (3)

tests/test_extraction.py (3)
437-437: Add more comprehensive testing coverage.

The warning message indicates that only the 'train/0' split has been checked. It would be beneficial to extend this coverage to other splits as well to ensure thorough testing of all scenarios.

454-454: Consider verifying the contents of code metadata.

The logger warning indicates that the contents of the code metadata were not checked. It might be beneficial to add a test or check to ensure the accuracy and integrity of the metadata contents.

Line range hint 330-330: Use key in dict instead of key in dict.keys().

This is a minor performance enhancement. Using key in dict is more Pythonic and slightly more efficient than using key in dict.keys().
- if key in dict.keys():
+ if key in dict:

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 70352b6 and 401fbbe.

Files selected for processing (11)

configs/extraction.yaml (4 hunks)
configs/preprocess.yaml (1 hunks)
scripts/extraction/collect_code_metadata.py (1 hunks)
scripts/preprocessing/add_time_derived_measurements.py (3 hunks)
scripts/preprocessing/collect_code_metadata.py (3 hunks)
scripts/preprocessing/filter_codes.py (1 hunks)
scripts/preprocessing/filter_outliers.py (1 hunks)
scripts/preprocessing/filter_patients.py (3 hunks)
src/MEDS_polars_functions/code_metadata.py (9 hunks)
src/MEDS_polars_functions/filter_measurements.py (1 hunks)
tests/test_extraction.py (1 hunks)

Files not reviewed due to errors (3)

scripts/preprocessing/filter_codes.py (no review received)
scripts/preprocessing/filter_outliers.py (no review received)
scripts/preprocessing/add_time_derived_measurements.py (no review received)

Files skipped from review due to trivial changes (1)

scripts/preprocessing/filter_patients.py

Additional context used

Ruff

src/MEDS_polars_functions/filter_measurements.py

8-8: Module level import not at top of file (E402)

119-123: f-string used as docstring. Python will interpret this as a joined string, rather than a docstring. (B021)

207-211: f-string used as docstring. Python will interpret this as a joined string, rather than a docstring. (B021)

tests/test_extraction.py

7-7: Module level import not at top of file (E402)

8-8: Module level import not at top of file (E402)

9-9: Module level import not at top of file (E402)

10-10: Module level import not at top of file (E402)

11-11: Module level import not at top of file (E402)

13-13: Module level import not at top of file (E402)

14-14: Module level import not at top of file (E402)

15-15: Module level import not at top of file (E402)

330-330: Use key in dict instead of key in dict.keys() (SIM118)

Remove .keys()

Additional comments not posted (23)

configs/preprocess.yaml (4)

28-28: Configuration for age and time_of_day enhances pipeline flexibility and specificity.

36-40: Renaming and addition of do_summarize_over_all_codes align well with enhanced data aggregation needs.

46-52: Enhanced outlier detection configurations contribute to the robustness and accuracy of the data preprocessing pipeline.

55-55: Updated aggregations in fit_normalization are consistent with the overall improvements in data processing.
scripts/preprocessing/collect_code_metadata.py (6)
31-32: Check for consistent handling of paths and directory structures.

Also applies to: 34-34

Verification successful

The search results indicate that the Path object from the pathlib module is used consistently across various files to handle paths and directory structures. The handling of paths appears to be uniform and follows a standard approach.

Consistent Usage: The Path object is imported and used consistently across multiple files to define input and output directories.

Directory Structures: The directory structures are handled uniformly with the Path object, ensuring consistency in path manipulations.

No inconsistencies were found in the handling of paths and directory structures.
Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Description: Ensure consistency in path handling.
rg --type python --glob '*.py' 'Path'
Length of output: 68

Script:
#!/bin/bash
# Description: Ensure consistency in path handling by searching for 'Path' usage in Python files.
rg 'Path' --glob '*.py'
Length of output: 7182
19-19: Ensure the correct configuration file is referenced.

Verification successful

The configuration path appears to be consistently referenced across multiple scripts. Specifically, the config_path is set to "../../configs" for various preprocessing and extraction scripts. This indicates that the configuration path is correctly updated and referenced.

scripts/preprocessing/collect_code_metadata.py:@hydra.main(version_base=None, config_path="../../configs", config_name="preprocess")

Other scripts also follow the same pattern, ensuring consistency.
Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Description: Verify the configuration path is correctly updated across the project.
rg --type python --glob '*.py' '@hydra.main'
Length of output: 75

Script:
#!/bin/bash
# Description: Verify the configuration path is correctly updated across the project.
rg --glob '*.py' '@hydra.main'
Length of output: 1539
79-79: Confirm that the output file path and name are correctly handled.

Verification successful

The output from the shell script confirms that the file path and name code_metadata.parquet are used consistently across multiple files in the codebase. The relevant occurrences are:

verify_parquet.py

tests/test_extraction.py

scripts/preprocessing/filter_outliers.py

scripts/preprocessing/collect_code_metadata.py

scripts/preprocessing/filter_codes.py

scripts/extraction/collect_code_metadata.py

This indicates that the output file path and name are correctly handled.
Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Description: Verify the output file path and name are correct.
rg --type python --glob '*.py' 'code_metadata.parquet'
Length of output: 85

Script:
#!/bin/bash
# Description: Verify the output file path and name are correct.
rg --type py 'code_metadata.parquet'
Length of output: 699
39-39: Verify the correct configuration parameters are passed to mapper_fntr.

Verification successful

The search results show that mapper_fntr is consistently called with two parameters, typically stage_cfg and a second parameter that varies (e.g., None, ListConfig(code_modifier_columns), code_modifier_columns). This suggests that the correct configuration parameters are being passed as per different contexts.

src/MEDS_polars_functions/code_metadata.py

scripts/extraction/collect_code_metadata.py

scripts/preprocessing/collect_code_metadata.py
Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Description: Check if `mapper_fntr` is used consistently with the same parameters across different scripts.
rg --type python --glob '*.py' 'mapper_fntr'
Length of output: 75

Script:
#!/bin/bash
# Description: Check if `mapper_fntr` is used consistently with the same parameters across different scripts.
rg --type py --glob '*.py' 'mapper_fntr'
Length of output: 1344
47-47: Ensure output paths are handled consistently and correctly.

Verification successful

The output paths are handled consistently across the codebase with appropriate directory variables (output_dir, metadata_input_dir, and MEDS_cohort_dir).

verify_parquet.py: Direct assignment to file_path.

tests/test_extraction.py: Assignment to output_file using MEDS_cohort_dir.

scripts/extraction/collect_code_metadata.py and scripts/preprocessing/collect_code_metadata.py: Using output_dir.

scripts/preprocessing/filter_codes.py and scripts/preprocessing/filter_outliers.py: Using metadata_input_dir.
Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Description: Verify that output paths are consistently handled.
rg --type python --glob '*.py' '/ "code_metadata.parquet"'
Length of output: 91

Script:
#!/bin/bash
# Description: Verify that output paths are consistently handled.
rg --glob '*.py' 'code_metadata.parquet'
Length of output: 703

Script:
#!/bin/bash
# Extract relevant lines from the files identified by the previous search
rg 'code_metadata.parquet' verify_parquet.py tests/test_extraction.py scripts/extraction/collect_code_metadata.py scripts/preprocessing/filter_codes.py scripts/preprocessing/collect_code_metadata.py scripts/preprocessing/filter_outliers.py
Length of output: 902
76-76: Check the configuration parameters passed to reducer_fntr for consistency.

Verification successful

The parameters passed to reducer_fntr in both scripts/extraction/collect_code_metadata.py and scripts/preprocessing/collect_code_metadata.py are consistent with its definition. Specifically, both files pass cfg.stage_cfg and cfg.get("code_modifier_columns", None), which align with the parameters stage_cfg and code_modifier_columns expected by reducer_fntr.

scripts/preprocessing/collect_code_metadata.py: reducer_fn = reducer_fntr(cfg.stage_cfg, cfg.get("code_modifier_columns", None))

scripts/extraction/collect_code_metadata.py: reducer_fn = reducer_fntr(cfg.stage_cfg, cfg.get("code_modifier_columns", None))
Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Description: Ensure that `reducer_fntr` is consistently used across the project.
rg --type python --glob '*.py' 'reducer_fntr'
Length of output: 76

Script:
#!/bin/bash
# Description: Ensure that `reducer_fntr` is consistently used across the project.
rg -t py 'reducer_fntr'
Length of output: 1032

Script:
#!/bin/bash
# Description: Retrieve the definition of `reducer_fntr` and its usages for detailed analysis.
ast-grep --lang python --pattern $'def reducer_fntr($_, $_):\n    $$$' 
Length of output: 70

Script:
#!/bin/bash
# Description: Retrieve the definition of `reducer_fntr` and its usages for detailed analysis.
ast-grep --lang python --pattern 'def reducer_fntr'
Length of output: 398

Script:
#!/bin/bash
# Description: Search for the definition of `reducer_fntr` in the specific file.
rg 'def reducer_fntr' src/MEDS_polars_functions/code_metadata.py
Length of output: 84

Script:
#!/bin/bash
# Description: Extract the full definition of `reducer_fntr` by capturing multiple lines from its definition.
rg -A 10 'def reducer_fntr' src/MEDS_polars_functions/code_metadata.py
Length of output: 800

Script:
#!/bin/bash
# Description: Verify the parameters passed to `reducer_fntr` in the specified files.
rg 'reducer_fntr' scripts/extraction/collect_code_metadata.py scripts/preprocessing/collect_code_metadata.py
Length of output: 608
scripts/extraction/collect_code_metadata.py (6)
80-80: Confirm that the output file path and name are correctly handled.

Verification successful

The output file path and name for code_metadata.parquet appear to be correctly handled in multiple instances across the codebase, including the file scripts/extraction/collect_code_metadata.py.

scripts/extraction/collect_code_metadata.py: write_lazyframe(reduced, output_dir / "code_metadata.parquet")

scripts/preprocessing/collect_code_metadata.py: write_lazyframe(reduced, output_dir / "code_metadata.parquet")
Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Description: Verify the output file path and name are correct.
rg --type python --glob '*.py' 'code_metadata.parquet'
Length of output: 85

Script:
#!/bin/bash
# Description: Verify the output file path and name are correct.
rg 'code_metadata.parquet' --glob '*.py'
Length of output: 703
19-19: Ensure the correct configuration file is referenced.

Verification successful

The @hydra.main decorator in scripts/extraction/collect_code_metadata.py correctly references the ../../configs path for the extraction configuration, consistent with other scripts in the extraction folder. No issues found.
Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Description: Verify the configuration path is correctly updated across the project.
rg --type python --glob '*.py' '@hydra.main'
Length of output: 75

Script:
#!/bin/bash
# Description: Verify the configuration path is correctly updated across the project.
rg --glob '*.py' '@hydra.main'
Length of output: 1539
40-40: Verify the correct configuration parameters are passed to mapper_fntr.

Verification successful

The mapper_fntr function is consistently called with stage_cfg and either None or code_modifier_columns across different scripts. This indicates that the configuration parameters passed to mapper_fntr in scripts/extraction/collect_code_metadata.py are correct and consistent with other usages.

src/MEDS_polars_functions/code_metadata.py:

mapper_fntr(stage_cfg, None)

mapper_fntr(stage_cfg, ListConfig(code_modifier_columns))

mapper_fntr(stage_cfg, code_modifier_columns)

scripts/preprocessing/collect_code_metadata.py:

mapper_fntr(cfg.stage_cfg, cfg.get("code_modifier_columns", None))

scripts/extraction/collect_code_metadata.py:

mapper_fntr(cfg.stage_cfg, cfg.get("code_modifier_columns", None))
Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Description: Check if `mapper_fntr` is used consistently with the same parameters across different scripts.
rg --type python --glob '*.py' 'mapper_fntr'
Length of output: 75

Script:
#!/bin/bash
# Description: Check if `mapper_fntr` is used consistently with the same parameters across different scripts.
rg --type py 'mapper_fntr'
Length of output: 1330
48-48: Ensure output paths are handled consistently and correctly.

31-33: Check for consistent handling of paths and directory structures.

Also applies to: 35-35

Verification successful

The search results show consistent usage of Path from the pathlib module across various scripts. The pattern of converting configuration paths into Path objects appears to be uniformly applied. Here are some examples:

tests/test_extraction.py

src/MEDS_polars_functions/utils.py

scripts/preprocessing/filter_patients.py

scripts/preprocessing/filter_outliers.py

scripts/preprocessing/filter_codes.py

scripts/preprocessing/collect_code_metadata.py

scripts/extraction/collect_code_metadata.py

scripts/extraction/split_and_shard_patients.py

scripts/extraction/shard_events.py

scripts/extraction/merge_to_MEDS_cohort.py

scripts/extraction/convert_to_sharded_events.py

eICU_Example/pre_MEDS.py

MIMIC-IV_Example/pre_MEDS.py

Given the extensive and consistent usage of Path across different modules and scripts, it appears that the handling of paths and directory structures is consistent.
Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Description: Ensure consistency in path handling.
rg --type python --glob '*.py' 'Path'
Length of output: 68

Script:
#!/bin/bash
# Description: Ensure consistency in path handling.
rg --type py --glob '*.py' 'Path'
Length of output: 7192
77-77: Check the configuration parameters passed to reducer_fntr for consistency.

Verification successful

The reducer_fntr function in src/MEDS_polars_functions/code_metadata.py expects two parameters: stage_cfg and code_modifier_columns. Both instances where reducer_fntr is called in scripts/preprocessing/collect_code_metadata.py and scripts/extraction/collect_code_metadata.py pass these parameters as expected.

scripts/preprocessing/collect_code_metadata.py

Line: reducer_fn = reducer_fntr(cfg.stage_cfg, cfg.get("code_modifier_columns", None))

scripts/extraction/collect_code_metadata.py

Line: reducer_fn = reducer_fntr(cfg.stage_cfg, cfg.get("code_modifier_columns", None))

Both calls are consistent with the definition of reducer_fntr.
Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Description: Ensure that `reducer_fntr` is consistently used across the project.
rg --type python --glob '*.py' 'reducer_fntr'
Length of output: 76

Script:
#!/bin/bash
# Description: Ensure that `reducer_fntr` is consistently used across the project.
rg --type py --glob '*.py' 'reducer_fntr'
Length of output: 1050

Script:
#!/bin/bash
# Description: Extract the definition of `reducer_fntr`.
ast-grep --lang python --pattern $'def reducer_fntr($_, $_):\n    $$$' --json

# Extract the code snippets where `reducer_fntr` is called.
rg --type py --glob '*.py' 'reducer_fntr('
Length of output: 202

Script:
#!/bin/bash
# Description: Extract the definition of `reducer_fntr`.
ast-grep --lang python --pattern $'def reducer_fntr($_, $_):\n    $$$'

# Extract the code snippets where `reducer_fntr` is called.
rg --type py --glob '*.py' 'reducer_fntr\('
Length of output: 886

Script:
#!/bin/bash
# Display the content of `src/MEDS_polars_functions/code_metadata.py` to manually review the `reducer_fntr` definition.
cat src/MEDS_polars_functions/code_metadata.py
Length of output: 30089
configs/extraction.yaml (2)
19-20: Confirm that the new stage collect_code_metadata and its settings are appropriately configured.

Also applies to: 27-27

Verification successful

The collect_code_metadata stage is properly defined with a description and specific aggregations. This configuration appears consistent with other stages in the YAML file.

configs/extraction.yaml:

Line 19-20: code_modifiers: null

Line 27: collect_code_metadata stage configuration
Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Description: Ensure that the new stage settings are consistent with other stages.
rg --type yaml --glob '*.yaml' 'collect_code_metadata'
Length of output: 272

Script:
#!/bin/bash
# Extract the full configuration of the `collect_code_metadata` stage from the YAML file.
rg -A 10 'collect_code_metadata:' configs/extraction.yaml
Length of output: 619
74-89: Check the configuration settings for the collect_code_metadata stage for correctness and completeness.

Verification successful

The collect_code_metadata stage in the configs/extraction.yaml file seems to be well-defined and complete with necessary fields such as description, aggregations, do_summarize_over_all_codes, is_metadata, mapper_output_dir, and output_dir.

description provides a detailed explanation of the stage.

aggregations lists the specific aggregations to compute.

do_summarize_over_all_codes is set to true, indicating that overall, code-independent counts should be included.

is_metadata is set to true.

mapper_output_dir and output_dir specify the directories for output.

No issues were found with the configuration.
Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Description: Verify the settings for `collect_code_metadata` are complete and correct.
rg --type yaml --glob '*.yaml' 'collect_code_metadata'
Length of output: 272

Script:
#!/bin/bash
# Description: List all fields under the `collect_code_metadata` stage in the `configs/extraction.yaml` file for verification.
awk '/collect_code_metadata:/,/^[^ ]/' configs/extraction.yaml
Length of output: 864
tests/test_extraction.py (1)

441-453: Ensure the output file existence is verified properly.

The code correctly checks for the existence of the output file 'code_metadata.parquet'. However, it would be prudent to also verify the contents of this file to ensure that the data meets expected conditions. Consider adding a check that reads the file and validates some expected metadata properties.

src/MEDS_polars_functions/code_metadata.py (4)

9-9: Ensure the added imports are used effectively throughout the codebase.

122-195: Updated function validate_args_and_get_code_cols to handle new configuration structures and validate them against the METADATA_FN enum.

The function now properly raises exceptions for incorrect configurations, which enhances robustness.

The use of OmegaConf for configuration parsing is a good practice in projects where configurations might vary significantly and need to be managed centrally.

Line range hint 196-382: The mapper_fntr function has been updated to handle dynamic aggregation configurations:

The function now dynamically constructs aggregation operations based on the configuration, which adds flexibility to the data processing pipeline.

The use of pattern matching (match-case) for handling code_modifier_columns is a modern Python feature that improves code readability and maintainability.

Line range hint 386-531: The reducer_fntr function has been enhanced to merge data from multiple frames based on the configuration:

The function now uses a more dynamic approach to handle different aggregation types, which allows for more flexible data reduction strategies.

The error handling for missing columns during the reduction process is robust, ensuring that the function fails gracefully with an informative error message if expected columns are missing.

configs/preprocess.yaml

src/MEDS_polars_functions/filter_measurements.py

…work.

…ong; shards should come from the raw input dir for pre-processing jobs.

…ands to the MIMIC example.

…ion.

…ted a bug in collection of code metadata

…hub.com/mmcdermott/MEDS_pytorch_dataset

…orch_datasets Tokenization and Tensorization

Get the preprocessing pipeline working E2E

coderabbitai

Actionable comments posted: 8

Outside diff range and nitpick comments (12)

scripts/preprocessing/tensorize.py (1)

20-20: Please provide a detailed description of the main function's purpose or remove the TODO if it's no longer relevant.

scripts/preprocessing/filter_outliers.py (1)

19-19: Please provide a detailed description of the main function's purpose or remove the TODO if it's no longer relevant.

scripts/preprocessing/fit_vocabulary_indices.py (1)

19-19: Please provide a detailed description of the main function's purpose or remove the TODO if it's no longer relevant.

scripts/preprocessing/normalize.py (1)

18-28: Consider providing a detailed function description in the docstring.

Adding a detailed docstring will improve the maintainability of the code by providing more context on what the function does and how it uses its parameters.

scripts/preprocessing/tokenization.py (1)

20-30: Consider adding a detailed function description in the docstring.

A detailed docstring would provide clarity on the function's purpose, parameters, and expected behavior, enhancing code maintainability.

src/MEDS_polars_functions/tensorize.py (1)

10-25: Enhance the function's documentation to include more detailed examples and error handling scenarios.

Expanding the documentation will help users understand the function's capabilities and limitations better, especially in handling different data formats and error scenarios.
MIMIC-IV_Example/joint_script_slurm.sh (1)
[!TIP]
Codebase Verification

The review comment suggested ensuring that the slurm scripts handle failures gracefully and that parameters are correctly passed to the sub-scripts. The shell script analysis revealed the following:

Parameter Usage: Parameters such as N_PARALLEL_WORKERS, MIMICIV_PREMEDS_DIR, MIMICIV_MEDS_DIR, and event_conversion_config_fp are correctly used in the sub-scripts.

Error Handling: There is minimal explicit error handling (try, except, catch, trap) in the sub-scripts. This indicates that the scripts may not handle failures as gracefully as expected.

To improve robustness, it is recommended to add explicit error handling in the sub-scripts to manage potential failures effectively.

Sub-scripts to review:

scripts/extraction/convert_to_sharded_events.py

scripts/extraction/merge_to_MEDS_cohort.py

scripts/extraction/shard_events.py

scripts/extraction/split_and_shard_patients.py

Category tags:

Here are the actionable points:

Add error handling mechanisms in the sub-scripts to ensure they handle failures gracefully.

Review and potentially update the error handling logic in the identified sub-scripts.

Analysis chain

Line range hint 47-112: Ensure that the slurm scripts handle failures gracefully and that the parameters are correctly passed to the sub-scripts.

Also, check that the parameters like N_PARALLEL_WORKERS and directory paths are correctly used in the sub-scripts.
Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Description: Verify the robustness of error handling in the slurm scripts and the correctness of parameter passing.

# Test: Simulate a failure in one of the sub-scripts and check if it's handled gracefully. Expect: Graceful handling of failures.
./scripts/extraction/test_failure_simulation.sh
Length of output: 172

Script:
#!/bin/bash
# Find the sub-scripts mentioned in the main script
fd -e py -e sh

# Check the usage of parameters and error handling in the sub-scripts
for script in $(fd -e py -e sh); do
  echo "Inspecting: $script"
  echo "Parameter usage:"
  rg 'N_PARALLEL_WORKERS|MIMICIV_PREMEDS_DIR|MIMICIV_MEDS_DIR|event_conversion_config_fp' "$script"
  echo "Error handling:"
  rg 'try|except|catch|trap' "$script"
done
Length of output: 24247
src/MEDS_polars_functions/utils.py (1)
Line range hint 114-114: Optimize dictionary key existence check.
-    for s in stage_configs.keys():
+    for s in stage_configs:
MIMIC-IV_Example/README.md (2)
Line range hint 192-192: Consider replacing "lots of" with a more precise term.

Consider using "many" instead of "lots of" for a more formal tone in documentation.

Line range hint 199-199: Add possessive apostrophe to "patients".
- How to handle the dob nonsense MIMIC has?
+ How to handle the dob nonsense MIMIC's has?
README.md (1)

358-382: Consider documenting or eventually supporting the "In-memory Training" and "Fixed-batch Retrieval" methods, as these could be beneficial for specific use cases.

Tools

LanguageTool

[formatting] ~365-~365: If the ‘because’ clause is essential to the meaning, do not use a comma before the clause. (COMMA_BEFORE_BECAUSE)
Context: ...d basis. This mode is extremely scalable, because the entire dataset never need be loaded...

[formatting] ~368-~368: If the ‘because’ clause is essential to the meaning, do not use a comma before the clause. (COMMA_BEFORE_BECAUSE)
Context: ...ze. This mode is also extremely flexible, because different cohorts can be loaded from th...
src/MEDS_polars_functions/code_metadata.py (1)
Line range hint 386-530: The reducer_fntr function is well-structured and includes detailed error handling and input validation. It effectively uses the aggregation operations defined in CODE_METADATA_AGGREGATIONS. The dynamic renaming of columns to handle multiple dataframes in the reduction process is a clever use of Polars' capabilities. However, add a check to ensure stage_cfg is of the correct type.
+ if not isinstance(stage_cfg, DictConfig):
+     raise TypeError("Expected stage_cfg to be of type DictConfig")

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 401fbbe and b7d3c4c.

Files selected for processing (25)

MIMIC-IV_Example/README.md (4 hunks)
MIMIC-IV_Example/joint_script_slurm.sh (2 hunks)
README.md (1 hunks)
configs/pipeline.yaml (1 hunks)
configs/preprocess.yaml (1 hunks)
preprocessing_operation_prototypes.md (1 hunks)
pyproject.toml (1 hunks)
scripts/extraction/collect_code_metadata.py (1 hunks)
scripts/preprocessing/add_time_derived_measurements.py (2 hunks)
scripts/preprocessing/collect_code_metadata.py (5 hunks)
scripts/preprocessing/filter_codes.py (1 hunks)
scripts/preprocessing/filter_outliers.py (1 hunks)
scripts/preprocessing/filter_patients.py (2 hunks)
scripts/preprocessing/fit_vocabulary_indices.py (1 hunks)
scripts/preprocessing/normalize.py (1 hunks)
scripts/preprocessing/tensorize.py (1 hunks)
scripts/preprocessing/tokenization.py (1 hunks)
src/MEDS_polars_functions/code_metadata.py (9 hunks)
src/MEDS_polars_functions/filter_measurements.py (1 hunks)
src/MEDS_polars_functions/get_vocabulary.py (1 hunks)
src/MEDS_polars_functions/normalization.py (5 hunks)
src/MEDS_polars_functions/tensorize.py (1 hunks)
src/MEDS_polars_functions/tokenize.py (1 hunks)
src/MEDS_polars_functions/utils.py (1 hunks)
terminology.md (1 hunks)

Files skipped from review due to trivial changes (2)

configs/pipeline.yaml
pyproject.toml

Files skipped from review as they are similar to previous changes (6)

configs/preprocess.yaml
scripts/extraction/collect_code_metadata.py
scripts/preprocessing/add_time_derived_measurements.py
scripts/preprocessing/collect_code_metadata.py
scripts/preprocessing/filter_codes.py
scripts/preprocessing/filter_patients.py

Additional context used

Markdownlint

terminology.md

3-3: Expected: h2; Actual: h4 (MD001, heading-increment)
Heading levels should only increment by one level at a time

MIMIC-IV_Example/README.md

183-183: Punctuation: ':' (MD026, no-trailing-punctuation)
Trailing punctuation in heading

43-43: null (MD034, no-bare-urls)
Bare URL used

preprocessing_operation_prototypes.md

42-42: Expected: h4; Actual: h5 (MD001, heading-increment)
Heading levels should only increment by one level at a time

70-70: Expected: h4; Actual: h5 (MD001, heading-increment)
Heading levels should only increment by one level at a time

24-24: Expected: underscore; Actual: asterisk (MD049, emphasis-style)
Emphasis style

24-24: Expected: underscore; Actual: asterisk (MD049, emphasis-style)
Emphasis style

README.md

254-254: Punctuation: ':' (MD026, no-trailing-punctuation)
Trailing punctuation in heading

417-417: Punctuation: ':' (MD026, no-trailing-punctuation)
Trailing punctuation in heading

Ruff

src/MEDS_polars_functions/utils.py

114-114: Use key in dict instead of key in dict.keys() (SIM118)

Remove .keys()

src/MEDS_polars_functions/filter_measurements.py

8-8: Module level import not at top of file (E402)

119-123: f-string used as docstring. Python will interpret this as a joined string, rather than a docstring. (B021)

207-211: f-string used as docstring. Python will interpret this as a joined string, rather than a docstring. (B021)

LanguageTool

MIMIC-IV_Example/README.md

[uncategorized] ~87-~87: Possible missing comma found. (AI_HYDRA_LEO_MISSING_COMMA)
Context: ...ry we'll denote as $MIMICIV_MEDS_DIR. Note this is a different directory than the ...

[style] ~192-~192: The phrase ‘lots of’ might be wordy and overused. Consider using an alternative. (A_LOT_OF)
Context: ...etimeevents5.icu/ingredientevents` Lots of questions remain about how to appropria...

[uncategorized] ~199-~199: It seems likely that a singular genitive (’s) apostrophe is missing. (AI_HYDRA_LEO_APOSTROPHE_S_XS)
Context: ...athtimes between the hosp table and the patients table? 2. How to handle the dob nonsens...

preprocessing_operation_prototypes.md

[style] ~7-~7: The phrase “a variety of” may be wordy. To make your writing clearer, consider replacing it. (A_VARIETY_OF)
Context: ...that can be applied to MEDS datasets in a variety of circumstances to accomplish diverse, ye...

[style] ~32-~32: Consider replacing this phrase with the adverb “consistently” to avoid wordiness. (IN_A_X_MANNER)
Context: ...ts being merged into the output dataset in a consistent manner. Currently, these capabilities are only...

[uncategorized] ~45-~45: Possible missing comma found. (AI_HYDRA_LEO_MISSING_COMMA)
Context: ...olumns should not be modified in this step as that will break the linkage with ...

[uncategorized] ~56-~56: A comma may be missing after the conjunctive/linking adverb ‘Currently’. (SENT_START_CONJUNCTIVE_LINKING_ADVERB_COMMA)
Context: ...paradigm is not yet implemented. ##### Currently Supported Operations Functions: 1. As...

[style] ~80-~80: ‘prior to’ might be wordy. Consider a shorter alternative. (EN_WORDINESS_PREMIUM_PRIOR_TO)
Context: ... patient data filters should be applied prior to aggregation. 2. What aggregation functi...

[style] ~84-~84: Consider shortening this phrase to just ‘whether’, unless you mean ‘regardless of whether’. (WHETHER)
Context: ...nto a single, unified metadata file. 3. Whether or not aggregation functions should be compute...

[style] ~89-~89: ‘prior to’ might be wordy. Consider a shorter alternative. (EN_WORDINESS_PREMIUM_PRIOR_TO)
Context: ... lacks support for patient-data filters prior to aggregation. ##### Currently supported...

[uncategorized] ~91-~91: A comma may be missing after the conjunctive/linking adverb ‘Currently’. (SENT_START_CONJUNCTIVE_LINKING_ADVERB_COMMA)
Context: ...ta filters prior to aggregation. ##### Currently supported operations Patient Filters: ...

[style] ~120-~120: ‘on the basis of’ might be wordy. Consider a shorter alternative. (EN_WORDINESS_PREMIUM_ON_THE_BASIS_OF)
Context: ...move all data corresponding to patients on the basis of the resulting criteria. 3. Return the f...

[uncategorized] ~136-~136: A comma may be missing after the conjunctive/linking adverb ‘Currently’. (SENT_START_CONJUNCTIVE_LINKING_ADVERB_COMMA)
Context: ...s/filter_patients_by_length.py`. ##### Currently supported operations 1. Filtering pati...

[uncategorized] ~168-~168: A comma may be missing after the conjunctive/linking adverb ‘Currently’. (SENT_START_CONJUNCTIVE_LINKING_ADVERB_COMMA)
Context: ...h extension relatively natively. ##### Currently supported operations Currently, measur...

[style] ~170-~170: ‘on the basis of’ might be wordy. Consider a shorter alternative. (EN_WORDINESS_PREMIUM_ON_THE_BASIS_OF)
Context: ...Currently, measurements can be filtered on the basis of min_patients_per_code and `min_occurr...

[uncategorized] ~197-~197: Use a comma before “and” if it connects two independent clauses (unless they are closely connected and short). (COMMA_COMPOUND_SENTENCE_2)
Context: ...ion that the indicator columns are added and this function is not reversible. #####...

[uncategorized] ~219-~219: A comma may be missing after the conjunctive/linking adverb ‘Currently’. (SENT_START_CONJUNCTIVE_LINKING_ADVERB_COMMA)
Context: ... is not yet a general prototype. ##### Currently supported operations 1. Occluding nume...

README.md

[grammar] ~9-~9: “Google” is a proper noun and needs to be capitalized. (A_GOOGLE)
Context: ...e the "Roadmap" section below and [this google doc](https://docs.google.com/document/d...

[grammar] ~98-~98: The modal verb ‘will’ requires the verb’s base form. (MD_BASEFORM)
Context: ...worker jobs. Note this will typically required a distributed file system to work corre...

[uncategorized] ~102-~102: Possible missing comma found. (AI_HYDRA_LEO_MISSING_COMMA)
Context: ...ficant speedup depending on the machine configuration as sharding ensures that parallelism ca...

[style] ~109-~109: As an alternative to the over-used intensifier ‘extremely’, consider replacing this phrase. (EN_WEAK_ADJECTIVE)
Context: ...idual stage of the pipeline can be kept extremely simple (often each stage corresponds to a sing...

[style] ~112-~112: The phrase “a variety of” may be wordy. To make your writing clearer, consider replacing it. (A_VARIETY_OF)
Context: ...parallelization) use of the pipeline in a variety of environments, all without imposing sign...

[style] ~157-~157: You have already used this phrasing in nearby sentences. Consider replacing it to add variety to your writing. (REP_NEED_TO_VB)
Context: ...s and external ontologies? Will they need to specify select key-event concepts to id...

[uncategorized] ~183-~183: Possible missing comma found. (AI_HYDRA_LEO_MISSING_COMMA)
Context: ...achine or across many worker nodes on a cluster provided the worker nodes have acces...

[typographical] ~211-~211: Consider adding a comma here. (FOR_NOW_COMMA)
Context: ...the other valid nested MEDS format, but for now we leave that up to the user. #### Inp...

[style] ~275-~275: The phrase “a variety of” may be wordy. To make your writing clearer, consider replacing it. (A_VARIETY_OF)
Context: ...r downstream use. This package contains a variety of pre-processing transformations and scri...

[style] ~282-~282: This phrase is redundant. Consider using “outside”. (OUTSIDE_OF)
Context: ... - Removing numerical values that fall outside of pre-specified, per-code ranges (e.g., f...

[style] ~291-~291: Consider using “who” when you are referring to a person instead of an object. (THAT_WHO)
Context: ..." event to the dataset for each patient that occurs at the end of the observation ...

[style] ~300-~300: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym. (ENGLISH_WORD_REPEAT_BEGINNING_RULE)
Context: ...s each code occurs in the dataset. - Computing appropriate numerical bins for each cod...

[uncategorized] ~304-~304: Possible missing preposition found. (AI_HYDRA_LEO_MISSING_TO)
Context: ...sure that the only columns that need be processed going into the pre-processing...

[formatting] ~365-~365: If the ‘because’ clause is essential to the meaning, do not use a comma before the clause. (COMMA_BEFORE_BECAUSE)
Context: ...d basis. This mode is extremely scalable, because the entire dataset never need be loaded...

[formatting] ~368-~368: If the ‘because’ clause is essential to the meaning, do not use a comma before the clause. (COMMA_BEFORE_BECAUSE)
Context: ...ze. This mode is also extremely flexible, because different cohorts can be loaded from th...

[grammar] ~417-~417: It appears that a hyphen is missing in the plural noun “to-dos”? (TO_DO_HYPHEN)
Context: ...bmitit cluster parallelism support. ## TODOs: 1. We need to have a vehicle to clean...

[grammar] ~422-~422: If ‘kind’ is a classification term, ‘a’ is not necessary. Use “kind of”. (The phrases ‘kind of’ and ‘sort of’ are informal if they mean ‘to some extent’.) (KIND_OF_A)
Context: ... right prior files. Likely need some kind of a "prior stage name" config variable.

Additional comments not posted (26)

scripts/preprocessing/normalize.py (1)

39-58: Validate the use of pl.read_parquet with use_pyarrow=True.

Also, ensure that rwlock_wrap function handles the pl.scan_parquet and write_lazyframe correctly, especially with the do_overwrite configuration.

scripts/preprocessing/tokenization.py (1)

40-69: Ensure consistent handling of file paths and error handling in the tokenization process.

Also, verify that rwlock_wrap correctly handles pl.scan_parquet and write_lazyframe functions, especially considering the do_overwrite configuration.
src/MEDS_polars_functions/normalization.py (1)
7-7: Ensure the code_modifiers parameter is used correctly to filter the MEDS dataset.

Verification successful

The code_modifiers parameter is indeed used correctly in the normalize function to filter the MEDS dataset. The parameter is checked for a None value, initialized if necessary, and then utilized in the join operation and column selection.

Initialization: If code_modifiers is None, it is set to an empty list.

Column Selection: code_modifiers is added to the list of columns to select from code_metadata.

Join Operation: The join operation uses code_modifiers to join the df and code_metadata dataframes.

Additionally, the code_modifiers parameter is referenced in other related functions and scripts, ensuring its consistent usage across the codebase.

src/MEDS_polars_functions/normalization.py

Line 7: Parameter definition

Lines 41-42: Handling None value

Line 44: Column selection

Line 56: Join operation

src/MEDS_polars_functions/get_vocabulary.py

Various functions: validate_code_metadata, lexicographic_indices

scripts/preprocessing/normalize.py

Line 14: Configuration retrieval

Line 15: Partial function application

scripts/preprocessing/fit_vocabulary_indices.py

Lines 14-17: Configuration retrieval and initialization

Line 19: Function application
Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Verify the usage of the `code_modifiers` parameter in the normalization function.
rg --type python 'code_modifiers' --context 5
Length of output: 76

Script:
#!/bin/bash
# Verify the usage of the `code_modifiers` parameter in the normalization function.
rg --type py 'code_modifiers' --context 5
Length of output: 16822
src/MEDS_polars_functions/utils.py (1)

153-155: Add type checking for is_metadata to prevent runtime errors.

Proper type validation ensures robustness and prevents unexpected behavior during runtime.
src/MEDS_polars_functions/tokenize.py (2)
90-92: Ensure proper separation of static and dynamic data based on timestamp.

The logic correctly separates data into static and dynamic parts based on the presence of a timestamp.

139-151: Verify the correct joining of static and dynamic dataframes.

Verification successful

The full implementation of the extract_statics_and_schema function has been successfully retrieved. The function joins static_by_patient and schema_by_patient dataframes on the patient_id column using an inner join.

The static_by_patient dataframe is grouped by patient_id and aggregates the code and numerical_value columns.

The schema_by_patient dataframe is grouped by patient_id and aggregates the minimum timestamp as start_time and collects unique timestamp values.

The join operation appears to be correctly implemented, joining on the patient_id column with an inner join.
Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Verify the correct joining of static and dynamic dataframes in `extract_statics_and_schema`.
rg --type python 'join' --context 5 src/MEDS_polars_functions/tokenize.py
Length of output: 104

Script:
#!/bin/bash
# Locate the function `extract_statics_and_schema` and extract its contents.
ast-grep --lang python --pattern $'def extract_statics_and_schema($_):\n    $$$' src/MEDS_polars_functions/tokenize.py
Length of output: 118

Script:
#!/bin/bash
# Locate the function `extract_statics_and_schema` in the file `src/MEDS_polars_functions/tokenize.py`.
rg 'def extract_statics_and_schema' -A 20 src/MEDS_polars_functions/tokenize.py
Length of output: 1061

Script:
#!/bin/bash
# Locate the complete function `extract_statics_and_schema` in the file `src/MEDS_polars_functions/tokenize.py`.
rg 'def extract_statics_and_schema' -A 50 src/MEDS_polars_functions/tokenize.py
Length of output: 2628

Script:
#!/bin/bash
# Locate the complete function `extract_statics_and_schema` in the file `src/MEDS_polars_functions/tokenize.py`.
rg 'def extract_statics_and_schema' -A 100 src/MEDS_polars_functions/tokenize.py
Length of output: 5110
MIMIC-IV_Example/README.md (1)
85-85: Clarify the expected output directory for the MEDS extraction ETL.

Verification successful

The README.md does provide information about the expected output directory for the MEDS extraction ETL. It mentions that the final MEDS dataset should be output into a directory denoted as $MIMICIV_MEDS_DIR.

Line 85: "We will assume you want to output the final MEDS dataset into a directory we'll denote as $MIMICIV_MEDS_DIR."

This clarification is present in the documentation, making the review comment unnecessary.
Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Verify the documentation consistency for the output directory in MEDS extraction ETL.
rg --type markdown 'output_dir' MIMIC-IV_Example/README.md
Length of output: 153

Script:
#!/bin/bash
# Verify mentions of the MEDS extraction ETL in the README.md to check for references to the output directory.
rg -A 5 'MEDS extraction ETL' MIMIC-IV_Example/README.md
Length of output: 940
src/MEDS_polars_functions/filter_measurements.py (3)

8-8: Move the import statement to the top of the file to adhere to PEP 8 guidelines.

Tools

Ruff

8-8: Module level import not at top of file (E402)

119-123: Replace f-string docstrings with regular docstrings to prevent Python from interpreting them as joined strings.

Tools

Ruff

119-123: f-string used as docstring. Python will interpret this as a joined string, rather than a docstring. (B021)

207-211: Replace f-string docstrings with regular docstrings to prevent Python from interpreting them as joined strings.

Tools

Ruff

207-211: f-string used as docstring. Python will interpret this as a joined string, rather than a docstring. (B021)

src/MEDS_polars_functions/get_vocabulary.py (3)

9-22: Introduced an enumeration VOCABULARY_ORDERING for defining vocabulary ordering methods. This is a clear and extensible way to manage different ordering strategies.

28-92: Added the validate_code_metadata function to ensure the integrity of code metadata. This function checks for the presence of required columns and the uniqueness of the code and modifiers, which is crucial for maintaining data consistency.

95-180: Introduced lexicographic_indices to assign vocabulary indices based on lexicographic order. This function is well-documented and includes thorough error handling and examples, which enhances maintainability and usability.

preprocessing_operation_prototypes.md (9)

1-1: The document header clearly states that the content is still in progress, setting the correct expectations for the reader.

6-12: The introduction provides a good overview of the purpose of operation prototypes. This helps in understanding the context and utility of the described operations.

Tools

LanguageTool

[style] ~7-~7: The phrase “a variety of” may be wordy. To make your writing clearer, consider replacing it. (A_VARIETY_OF)
Context: ...that can be applied to MEDS datasets in a variety of circumstances to accomplish diverse, ye...

14-16: Use of the term "code" is well-explained, which is important for clarity and avoiding confusion given its potential overlap with programming terminology.
[APROVED]

18-25: The distinction between "removing data" and "occluding data" is clearly made, which is crucial for understanding the implications of different operations on data integrity and traceability.

Tools

Markdownlint

24-24: Expected: underscore; Actual: asterisk (MD049, emphasis-style)
Emphasis style

24-24: Expected: underscore; Actual: asterisk (MD049, emphasis-style)
Emphasis style

27-34: Describes the flexibility of filtering prototypes and their ability to be applied conditionally based on various criteria. However, it notes that these capabilities are not yet implemented, which is good for setting realistic expectations.

Tools

LanguageTool

[style] ~32-~32: Consider replacing this phrase with the adverb “consistently” to avoid wordiness. (IN_A_X_MANNER)
Context: ...ts being merged into the output dataset in a consistent manner. Currently, these capabilities are only...

37-41: Explains the "Transform Codes" operation, focusing on static remapping of codes. The warning against modifying code or code_modifier columns is crucial to maintain linkage with patient data.

64-77: Describes the "Collect metadata about code realizations in patient data" operation. It provides a clear step-by-step breakdown of the operation, enhancing understandability.

Tools

Markdownlint

70-70: Expected: h4; Actual: h5 (MD001, heading-increment)
Heading levels should only increment by one level at a time

106-122: Details different modes of filtering data from MEDS datasets. The step-by-step description of each mode helps in understanding the operational flow and potential applications.

Tools

LanguageTool

[style] ~120-~120: ‘on the basis of’ might be wordy. Consider a shorter alternative. (EN_WORDINESS_PREMIUM_ON_THE_BASIS_OF)
Context: ...move all data corresponding to patients on the basis of the resulting criteria. 3. Return the f...

180-234: Mentions planned operations for transforming measurements within events. This forward-looking statement helps in understanding the future scope of the project.

Tools

LanguageTool

[uncategorized] ~197-~197: Use a comma before “and” if it connects two independent clauses (unless they are closely connected and short). (COMMA_COMPOUND_SENTENCE_2)
Context: ...ion that the indicator columns are added and this function is not reversible. #####...

[uncategorized] ~219-~219: A comma may be missing after the conjunctive/linking adverb ‘Currently’. (SENT_START_CONJUNCTIVE_LINKING_ADVERB_COMMA)
Context: ... is not yet a general prototype. ##### Currently supported operations 1. Occluding nume...

README.md (2)

332-338: The description of the tokenization process is clear and well-articulated.

342-382: The tensorization section provides a comprehensive overview of different training modes, emphasizing the benefits of Direct Retrieval which is supported by this repository.

Tools

LanguageTool

[formatting] ~365-~365: If the ‘because’ clause is essential to the meaning, do not use a comma before the clause. (COMMA_BEFORE_BECAUSE)
Context: ...d basis. This mode is extremely scalable, because the entire dataset never need be loaded...

[formatting] ~368-~368: If the ‘because’ clause is essential to the meaning, do not use a comma before the clause. (COMMA_BEFORE_BECAUSE)
Context: ...ze. This mode is also extremely flexible, because different cohorts can be loaded from th...

src/MEDS_polars_functions/code_metadata.py (2)

9-9: Added imports from omegaconf are appropriate for the configuration management required in this module.

Line range hint 196-382: The function mapper_fntr is correctly implemented with comprehensive input validation and error handling, leveraging the previously defined validate_args_and_get_code_cols. The use of Python's pattern matching for code_modifier_columns is modern and aids in clarity.

README.md

src/MEDS_polars_functions/code_metadata.py

terminology.md

src/MEDS_polars_functions/normalization.py

src/MEDS_polars_functions/tokenize.py

scripts/preprocessing/normalize.py

src/MEDS_polars_functions/tensorize.py

coderabbitai

Actionable comments posted: 1

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between b7d3c4c and e70d26f.

Files selected for processing (1)

tests/test_extraction.py (3 hunks)

Additional context used

Ruff

tests/test_extraction.py

7-7: Module level import not at top of file (E402)

8-8: Module level import not at top of file (E402)

9-9: Module level import not at top of file (E402)

10-10: Module level import not at top of file (E402)

11-11: Module level import not at top of file (E402)

13-13: Module level import not at top of file (E402)

14-14: Module level import not at top of file (E402)

15-15: Module level import not at top of file (E402)

386-386: Use key in dict instead of key in dict.keys() (SIM118)

Remove .keys()

Additional comments not posted (3)

tests/test_extraction.py (3)

150-204: LGTM! The test data for various splits is well-defined and mirrors the expected structure of the MEDS outputs.

212-221: The updated SUB_SHARDED_OUTPUTS dictionary correctly maps new test data structures to their respective splits. This change ensures that the test suite can accurately validate the data processing across different stages.

228-230: The MEDS_OUTPUTS dictionary has been updated to reference the newly added data structures. This structure is crucial for ensuring that the output of the test cases can be validated against expected results.

tests/test_extraction.py

coderabbitai

Actionable comments posted: 1

Outside diff range and nitpick comments (2)

tests/test_extraction.py (2)
Line range hint 7-14: Consider moving module-level imports to the top of the file.
+ import json
+ import subprocess
+ import tempfile
+ from io import StringIO
+ from pathlib import Path
+ import polars as pl
+ from polars.testing import assert_frame_equal
+ pl.enable_string_cache()
Tools

Ruff

11-11: Module level import not at top of file (E402)

13-13: Module level import not at top of file (E402)

14-14: Module level import not at top of file (E402)

Line range hint 401-401: Simplify the dictionary key check by removing .keys().
- if key in dict.keys()
+ if key in dict
Tools

Ruff

11-11: Module level import not at top of file (E402)

13-13: Module level import not at top of file (E402)

14-14: Module level import not at top of file (E402)

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between e70d26f and 7150d5e.

Files selected for processing (1)

tests/test_extraction.py (5 hunks)

Additional context used

Ruff

tests/test_extraction.py

7-7: Module level import not at top of file (E402)

8-8: Module level import not at top of file (E402)

9-9: Module level import not at top of file (E402)

10-10: Module level import not at top of file (E402)

11-11: Module level import not at top of file (E402)

13-13: Module level import not at top of file (E402)

14-14: Module level import not at top of file (E402)

401-401: Use key in dict instead of key in dict.keys() (SIM118)

Remove .keys()

Additional comments not posted (1)

tests/test_extraction.py (1)

502-534: The assert_df_equal function is well-implemented with detailed output for debugging.

tests/test_extraction.py

codecov-commenter · 2024-06-21T11:14:59Z

Welcome to Codecov 🎉

Once you merge this PR into your default branch, you're all set! Codecov will compare coverage reports and display results in all future pull requests.

Thanks for integrating Codecov - We've got you covered ☂️

coderabbitai

Actionable comments posted: 3

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 7150d5e and 410e6ce.

Files selected for processing (2)

.github/workflows/tests.yaml (1 hunks)
src/MEDS_polars_functions/filter_measurements.py (1 hunks)

Files skipped from review due to trivial changes (1)

.github/workflows/tests.yaml

Additional context used

Ruff

src/MEDS_polars_functions/filter_measurements.py

119-123: f-string used as docstring. Python will interpret this as a joined string, rather than a docstring. (B021)

207-211: f-string used as docstring. Python will interpret this as a joined string, rather than a docstring. (B021)

Additional comments not posted (1)

src/MEDS_polars_functions/filter_measurements.py (1)

139-227: Review of filter_outliers_fntr function.

The filter_outliers_fntr function introduces a robust method for filtering out outliers based on standard deviations. This is a crucial step in data preprocessing to ensure the quality of data before further analysis. The function is implemented efficiently using Polars' lazy frames, which is suitable for handling large datasets. However, similar to the previous function, adding more inline comments could improve understanding and maintainability.

Tools

Ruff

207-211: f-string used as docstring. Python will interpret this as a joined string, rather than a docstring. (B021)

coderabbitai · 2024-06-21T11:18:36Z

src/MEDS_polars_functions/filter_measurements.py

+        >>> fn(data).collect()
+        shape: (2, 3)
+        ┌────────────┬──────┬───────────┐
+        │ patient_id ┆ code ┆ modifier1 │
+        │ ---        ┆ ---  ┆ ---       │
+        │ i64        ┆ cat  ┆ i64       │
+        ╞════════════╪══════╪═══════════╡
+        │ 1          ┆ A    ┆ 1         │
+        │ 1          ┆ B    ┆ 1         │
+        └────────────┴──────┴───────────┘
+        >>> stage_cfg = DictConfig({"min_patients_per_code": 1, "min_occurrences_per_code": 4})
+        >>> fn = filter_codes_fntr(stage_cfg, code_metadata_df, ["modifier1"])
+        >>> fn(data).collect()
+        shape: (2, 3)
+        ┌────────────┬──────┬───────────┐
+        │ patient_id ┆ code ┆ modifier1 │
+        │ ---        ┆ ---  ┆ ---       │
+        │ i64        ┆ cat  ┆ i64       │
+        ╞════════════╪══════╪═══════════╡
+        │ 1          ┆ A    ┆ 1         │
+        │ 2          ┆ A    ┆ 2         │
+        └────────────┴──────┴───────────┘
+        >>> stage_cfg = DictConfig({"min_patients_per_code": 1})
+        >>> fn = filter_codes_fntr(stage_cfg, code_metadata_df, ["modifier1"])
+        >>> fn(data).collect()
+        shape: (4, 3)
+        ┌────────────┬──────┬───────────┐
+        │ patient_id ┆ code ┆ modifier1 │
+        │ ---        ┆ ---  ┆ ---       │
+        │ i64        ┆ cat  ┆ i64       │
+        ╞════════════╪══════╪═══════════╡
+        │ 1          ┆ A    ┆ 1         │
+        │ 1          ┆ B    ┆ 1         │
+        │ 2          ┆ A    ┆ 2         │
+        │ 2          ┆ C    ┆ 2         │
+        └────────────┴──────┴───────────┘
+        >>> stage_cfg = DictConfig({"min_patients_per_code": None, "min_occurrences_per_code": None})
+        >>> fn = filter_codes_fntr(stage_cfg, code_metadata_df, ["modifier1"])
+        >>> fn(data).collect()
+        shape: (4, 3)
+        ┌────────────┬──────┬───────────┐
+        │ patient_id ┆ code ┆ modifier1 │
+        │ ---        ┆ ---  ┆ ---       │
+        │ i64        ┆ cat  ┆ i64       │
+        ╞════════════╪══════╪═══════════╡
+        │ 1          ┆ A    ┆ 1         │
+        │ 1          ┆ B    ┆ 1         │
+        │ 2          ┆ A    ┆ 2         │
+        │ 2          ┆ C    ┆ 2         │
+        └────────────┴──────┴───────────┘
+        >>> stage_cfg = DictConfig({"min_occurrences_per_code": 5})
+        >>> fn = filter_codes_fntr(stage_cfg, code_metadata_df, ["modifier1"])
+        >>> fn(data).collect()
+        shape: (1, 3)
+        ┌────────────┬──────┬───────────┐
+        │ patient_id ┆ code ┆ modifier1 │
+        │ ---        ┆ ---  ┆ ---       │
+        │ i64        ┆ cat  ┆ i64       │
+        ╞════════════╪══════╪═══════════╡
+        │ 2          ┆ A    ┆ 2         │
+        └────────────┴──────┴───────────┘
+    """
+
+    min_patients_per_code = stage_cfg.get("min_patients_per_code", None)
+    min_occurrences_per_code = stage_cfg.get("min_occurrences_per_code", None)
+
+    filter_exprs = []
+    if min_patients_per_code is not None:
+        filter_exprs.append(pl.col("code/n_patients") >= min_patients_per_code)
+    if min_occurrences_per_code is not None:
+        filter_exprs.append(pl.col("code/n_occurrences") >= min_occurrences_per_code)
+
+    if not filter_exprs:
+        return lambda df: df
+
+    join_cols = ["code"]
+    if code_modifier_columns:
+        join_cols.extend(code_modifier_columns)
+
+    allowed_code_metadata = (code_metadata.filter(pl.all_horizontal(filter_exprs)).select(join_cols)).lazy()
+
+    def filter_codes_fn(df: pl.LazyFrame) -> pl.LazyFrame:
+        f"""Filters patient events to only encompass those with a set of permissible codes.
+
+        In particular, this function filters the DataFrame to only include (code, modifier) pairs that have
+        at least {min_patients_per_code} patients and {min_occurrences_per_code} occurrences.
+        """
+
+        idx_col = "_row_idx"
+        while idx_col in df.columns:
+            idx_col = f"_{idx_col}"
+
+        return (
+            df.with_row_count(idx_col)
+            .join(allowed_code_metadata, on=join_cols, how="inner")
+            .sort(idx_col)
+            .drop(idx_col)
+        )
+
+    return filter_codes_fn


Refactor suggestion for filter_codes_fntr function.

The function filter_codes_fntr is well-structured and follows good Pythonic practices. However, consider adding more detailed comments within the function to explain the logic behind each step, especially within the filtering conditions. This will enhance maintainability and readability for other developers who might work on this code later.

Tools

Ruff

119-123: f-string used as docstring. Python will interpret this as a joined string, rather than a docstring. (B021)

coderabbitai · 2024-06-21T11:18:36Z

src/MEDS_polars_functions/filter_measurements.py

+        f"""Filters out outlier numerical values from patient events.
+
+        In particular, this function filters the DataFrame to only include numerical values that are within
+        {stddev_cutoff} standard deviations of the mean for the corresponding (code, modifier) pair.
+        """


Correct f-string usage in docstrings.

As with the previous function, replace the f-string in the docstring to avoid misinterpretations by Python.

- f"""Filters out outlier numerical values from patient events. + """Filters out outlier numerical values from patient events.

Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

f"""Filters out outlier numerical values from patient events.

In particular, this function filters the DataFrame to only include numerical values that are within

{stddev_cutoff} standard deviations of the mean for the corresponding (code, modifier) pair.

"""

"""Filters out outlier numerical values from patient events.

In particular, this function filters the DataFrame to only include numerical values that are within

{stddev_cutoff} standard deviations of the mean for the corresponding (code, modifier) pair.

"""

Tools

Ruff

207-211: f-string used as docstring. Python will interpret this as a joined string, rather than a docstring. (B021)

coderabbitai · 2024-06-21T11:18:36Z

src/MEDS_polars_functions/filter_measurements.py

+        f"""Filters patient events to only encompass those with a set of permissible codes.
+
+        In particular, this function filters the DataFrame to only include (code, modifier) pairs that have
+        at least {min_patients_per_code} patients and {min_occurrences_per_code} occurrences.
+        """


Replace f-string docstrings with regular docstrings.

The static analysis tool flagged the use of f-string as docstrings. Python interprets these as joined strings rather than docstrings, which could lead to unexpected behaviors or documentation issues.

- f"""Filters patient events to only encompass those with a set of permissible codes. + """Filters patient events to only encompass those with a set of permissible codes.

Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

f"""Filters patient events to only encompass those with a set of permissible codes.

In particular, this function filters the DataFrame to only include (code, modifier) pairs that have

at least {min_patients_per_code} patients and {min_occurrences_per_code} occurrences.

"""

"""Filters patient events to only encompass those with a set of permissible codes.

In particular, this function filters the DataFrame to only include (code, modifier) pairs that have

at least {min_patients_per_code} patients and {min_occurrences_per_code} occurrences.

"""

Tools

Ruff

119-123: f-string used as docstring. Python will interpret this as a joined string, rather than a docstring. (B021)

coderabbitai

Actionable comments posted: 0

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 7150d5e and 410e6ce.

Files selected for processing (2)

.github/workflows/tests.yaml (1 hunks)
src/MEDS_polars_functions/filter_measurements.py (1 hunks)

Files not reviewed due to errors (1)

src/MEDS_polars_functions/filter_measurements.py (no review received)

Additional context used

Ruff

src/MEDS_polars_functions/filter_measurements.py

119-123: f-string used as docstring. Python will interpret this as a joined string, rather than a docstring. (B021)

207-211: f-string used as docstring. Python will interpret this as a joined string, rather than a docstring. (B021)

Additional comments not posted (1)

.github/workflows/tests.yaml (1)

36-36: Enhanced Testing Configuration Approved

The addition of --doctest-modules and --cov=src flags in the pytest command are excellent for improving test coverage and ensuring that doctests are run. Ensure that the coverage paths are correctly configured to reflect the actual source directories.

mmcdermott added 9 commits May 31, 2024 08:18

Added some starter code for the preprocessing side.

ef12119

Merge branch 'multirun' into fixes_to_preprocessing

5060c40

Merge branch 'multirun' into fixes_to_preprocessing

a8cee7d

Fixed code_metadata setup.

3ab9e40

Enabled summarizing the entire population in the code metadata map/re…

5511f91

…duce functions.

Merge branch 'main' into fixes_to_preprocessing

2caa82e

Merge branch 'main' into fixes_to_preprocessing

7f0ecae

added one test and implemented the basic filter codes logic

7b6280d

Added the outlier filtering function.

cb0b2d3

mmcdermott added 4 commits June 13, 2024 15:00

Added a code metadata extraction step for core MEDS extraction -- thi…

70e3524

…s will also facilitate joining with ontologies and downstream usage (and it tests that, more generally pre-processing oriented utility).

Made script executable

92ec6da

Set up a mapper output directory -- not sure if this is the best poli…

f9a10fd

…cy yet.

Merge pull request #18 from mmcdermott/describe_codes_post_extraction

401fbbe

Added a code metadata extraction step for core MEDS extraction.

coderabbitai bot reviewed Jun 13, 2024

View reviewed changes

configs/preprocess.yaml Show resolved Hide resolved

configs/preprocess.yaml Show resolved Hide resolved

src/MEDS_polars_functions/filter_measurements.py Outdated Show resolved Hide resolved

mmcdermott added 15 commits June 13, 2024 16:10

made scripts executable

320c8b4

fixed a small typo (maybe?)

9f48793

Fixed an issue with using cfg instead of stage_cfg

e28ef24

Fixing some other typos

3821d30

Fixing anohter typo in time-derived

c1e7b48

Some small corrections and documentation for getting time derived to …

719e1d0

…work.

Fixing various typos -- the metadata_input_dir change was actually wr…

a03862f

…ong; shards should come from the raw input dir for pre-processing jobs.

Fixed some more typos and added more working or at least running comm…

01f3f78

…ands to the MIMIC example.

Added command for getting normalization parameters

6123087

Added code to get lexicographic code indices for vocabulary tokenizat…

148fdd5

…ion.

Added (yet untested) script for fitting vocabulary indices and correc…

935102e

…ted a bug in collection of code metadata

documentation

3f32271

Pre-processing prototypes documentation improvements

e8a01b4

Further extensions to documentation.

d0a4a0c

Added (untested) code metadata type shrinking

2bb2863

mmcdermott added 19 commits June 16, 2024 23:01

Added first test to tokenize.

a4741fe

Added a second test.

c67b50c

Added the last of the tests.

501adba

Tested tensorize as well

833adbf

Some minor corrections for the pytorch dataset.

0509faa

Removing pytorch dataset files as they are being moved to https://git…

8fac9b8

…hub.com/mmcdermott/MEDS_pytorch_dataset

Resolved test stochasticity

9c6d7c4

removing unneeded dependency.

605c651

removing unused config

eb9a176

Added (yet untested) tokenization and tensorization scripts

13e99dd

Had to rename to avoid an import issue with hydra.

1f616b3

Change file extension for NRT files

6f70306

Merge pull request #22 from mmcdermott/tokenization_tensorization_pyt…

44bd331

…orch_datasets Tokenization and Tensorization

Fixed a test that fails on numpy 2.0

dd78d1f

Update pyproject.toml

42eb4a6

Try to correct github lint issue.

f3a9edb

Updated the MIMIC README and removed the troublesome portions.

38c1d78

Updated the MIMIC README and removed the troublesome portions.

f368bbe

Merge pull request #21 from mmcdermott/preprocessing_steps

b7d3c4c

Get the preprocessing pipeline working E2E

coderabbitai bot reviewed Jun 19, 2024

View reviewed changes

Checked the other shards for #23

e70d26f

coderabbitai bot reviewed Jun 19, 2024

View reviewed changes

tests/test_extraction.py Outdated Show resolved Hide resolved

Added code metadata checking to the test.

7150d5e

coderabbitai bot reviewed Jun 20, 2024

View reviewed changes

tests/test_extraction.py Show resolved Hide resolved

Re-arranged import statements

1ec3934

Updated workflow

410e6ce

coderabbitai bot reviewed Jun 21, 2024

View reviewed changes

mmcdermott merged commit e1e6260 into main Jun 24, 2024
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

IN PROGRESS general dev contributions. Currently tracking: #17

IN PROGRESS general dev contributions. Currently tracking: #17

mmcdermott commented Jun 13, 2024 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jun 13, 2024 •

edited

Loading

Chat

CodeRabbit Commands (invoked as PR comments)

CodeRabbit Configration File (`.coderabbit.yaml`)

Documentation and Community

coderabbitai bot left a comment

coderabbitai bot left a comment

coderabbitai bot left a comment

coderabbitai bot left a comment

codecov-commenter commented Jun 21, 2024

coderabbitai bot left a comment

coderabbitai bot Jun 21, 2024

coderabbitai bot Jun 21, 2024

coderabbitai bot Jun 21, 2024

coderabbitai bot left a comment

IN PROGRESS general dev contributions. Currently tracking: #17

IN PROGRESS general dev contributions. Currently tracking: #17

Conversation

mmcdermott commented Jun 13, 2024 • edited by coderabbitai bot Loading

Summary by CodeRabbit

coderabbitai bot commented Jun 13, 2024 • edited Loading

Walkthrough

Changes

Poem

Chat

CodeRabbit Commands (invoked as PR comments)

CodeRabbit Configration File (.coderabbit.yaml)

Documentation and Community

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

codecov-commenter commented Jun 21, 2024

Welcome to Codecov 🎉

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot Jun 21, 2024

Choose a reason for hiding this comment

coderabbitai bot Jun 21, 2024

Choose a reason for hiding this comment

coderabbitai bot Jun 21, 2024

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

mmcdermott commented Jun 13, 2024 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jun 13, 2024 •

edited

Loading

CodeRabbit Configration File (`.coderabbit.yaml`)